Automatic Ticket Classification - EDA

Project Intro Logo.png

  • Import & Analyse the data.

  • Check for Incomplete Information

  • Target Class Distribution

  • The target class distribution is heavily imbalanced as most calls are assinged to Group 0 and exluding this as well, we find an imabalanced dataset for the rest of the groups.

  • Choosing a Metric to benchmark model performance

As we want to be able to classify the tickets into all functional groups and functional groups are given equal importance, we choose AUC as the final metric to score model performance.

  • Outlier Analysis

  • Most descriptions have between 6 and 28 words long with median at 41 (106 characters) and mean at 27.2 with relatively few outliers ranging till 1625 words!
  • Most Short descriptions have between 4 and 9 words long with median at 6 (41 characters) and mean at 6.92 with relatively few outliers ranging till 28 words.

  • Fix Text Encoding

  • Keyword Extraction

  • Word Frequency Distributions & WordClouds

  • Stopwords and Anchor words like 'From:', 'Recieved' have to be stripped out
  • Many stopwords are occuring most frequently in the dataset. We might need to use stopword removal in our pre-processing if it improves the model performance.

  • Descriptions WordCloud

  • Short Descriptions WordCloud

  • Group 0 Descriptions WordCloud

  • Other Groups Descriptions WordCloud

  • Short Descriptions

  • Descriptions

  • Description Lengths vs. Functional Group

  • Language Detection

https://fasttext.cc/docs/en/language-identification.html

  • Pre-Processing

Pipeline

  • TODO: Language Translation

  • TODO: Outage Questionnaires

  • TODO: Security/Event Logs

TODO: dig deeper into security logs and cyber security issues handled by 2, 39, 12, ...

  • Parse Emails

  • Clean up caller ids in description

  • Clean Irrelevant Information

  • Clean Anchors

  • Gibberish Removal

ML Model Building

Vizualizing the type of queries based on groups

Data pre-processing

LightGBM with RandomsearchCV

Random forest with RandomsearchCV

XGBoost with RandomsearchCV

Based on the word cloud we do notice that there is bit of an overlap between group L1 and L3, hence collapsing the targets further into 2 classes

Deep Learning Model Building for All Classes

  • OBJECTIVE:

Use the text data to build simple feed-forward Neural Nets and benchmark against the base ML models.

  • Import & Analyse the data.

  • Tokenize and pad sequences

  • GloVe Embeddings

  • Simple Feed-Forward Neural Net

  • This model is clearly overfitting, we will add regularization to the next iteration

  • Add Dropout Layer

  • Use pre-trained embeddings

  • LSTM

  • Bi-Directional LSTM

  • CNN (Dimensionality Reduction) + LSTM

  • CNN (Dimensionality Reduction) + Bi-Directional LSTM

  • Use TfIdf vectors instead of Embedding Layer + Feature Selection

  • Resultant Metrics:

Model Test Accuracy
Simple Feed-Forward Nerual Net 60.40
Feed-Forward NN + Batch Norm 63.43
Feed-Forward NN + Dropout 64.73
Feed-Forward NN + Pre-trained GLoVe embeddings 61.53
LSTM 49.91
Bi-Directional LSTM 65.87
Convolution Blocks (Dimensionality Reduction) + LSTM 54.71
Convolution Blocks (Dimensionality Reduction) + Bi-LSTM 59.87
TfIdf Vectors + Feature Selection + Feed-forward Neural Net 66.80

Deep Learning Model Building for Binary Classification

  • OBJECTIVE:

Use the text data to build simple feed-forward Neural Nets and benchmark against the base ML models.

  • Import the clean data.

  • Tokenize and pad sequences

  • GloVe Embeddings

  • Simple Feed-Forward Neural Net

  • This model is clearly overfitting, we will add regularization to the next iteration

  • Simple Feed-Forward Neural Net + Batch Normalization

  • Simple Feed-Forward Neural Net + Dropout

  • Use pre-trained embeddings

  • LSTM

  • Bi-Directional LSTM

  • CNN (Dimensionality Reduction) + LSTM

  • CNN (Dimensionality Reduction) + Bi-Directional LSTM

  • Use TfIdf vectors instead of Embedding Layer + Feature Selection

  • Use TfIdf vectors instead of Embedding Layer + Feature Selection + Stratified KFold Training

  • Metrics:

Model Test Accuracy
Simple Feed-Forward Net using Embedding Layer 86.25%
Feed-Forward NN + Batch Norm 83.76%
Feed-Forward NN + Dropout 85.83%
Feed-Forward NN + Pre-trained GloVe embeddings 82.75%
LSTM 65.44%
Bi-Directional LSTM 85.24%
Convolution Blocks (Dimensionality Reduction) + LSTM 84.47%
Convolution Blocks (Dimensionality Reduction) + Bi-LSTM 85.00%
TfIdf Vectors + Feature Selection + Feed-forward Neural Net 85.77%
Stratified KFold Validation + TfIdf Vectors + Feature Selection + +Feed-forward Neural Net 86.40%